1,402 research outputs found
Using linear predictors to impute allele frequencies from summary or pooled genotype data
Recently-developed genotype imputation methods are a powerful tool for
detecting untyped genetic variants that affect disease susceptibility in
genetic association studies. However, existing imputation methods require
individual-level genotype data, whereas, in practice, it is often the case that
only summary data are available. For example, this may occur because, for
reasons of privacy or politics, only summary data are made available to the
research community at large; or because only summary data are collected, as in
DNA pooling experiments. In this article we introduce a new statistical method
that can accurately infer the frequencies of untyped genetic variants in these
settings, and indeed substantially improve frequency estimates at typed
variants in pooling experiments where observations are noisy. Our approach,
which predicts each allele frequency using a linear combination of observed
frequencies, is statistically straightforward, and related to a long history of
the use of linear methods for estimating missing values (e.g., Kriging). The
main statistical novelty is our approach to regularizing the covariance matrix
estimates, and the resulting linear predictors, which is based on methods from
population genetics. We find that, besides being both fast and
flexible---allowing new problems to be tackled that cannot be handled by
existing imputation approaches purpose-built for the genetic context---these
linear methods are also very accurate. Indeed, imputation accuracy using this
approach is similar to that obtained by state-of-the-art imputation methods
that use individual-level data, but at a fraction of the computational cost.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS338 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Empirical Bayes Shrinkage and False Discovery Rate Estimation, Allowing For Unwanted Variation
We combine two important ideas in the analysis of large-scale genomics
experiments (e.g. experiments that aim to identify genes that are
differentially expressed between two conditions). The first is use of Empirical
Bayes (EB) methods to handle the large number of potentially-sparse effects,
and estimate false discovery rates and related quantities. The second is use of
factor analysis methods to deal with sources of unwanted variation such as
batch effects and unmeasured confounders. We describe a simple modular fitting
procedure that combines key ideas from both these lines of research. This
yields new, powerful EB methods for analyzing genomics experiments that account
for both sparse effects and unwanted variation. In realistic simulations, these
new methods provide significant gains in power and calibration over competing
methods. In real data analysis we find that different methods, while often
conceptually similar, can vary widely in their assessments of statistical
significance. This highlights the need for care in both choice of methods and
interpretation of results. All methods introduced in this paper are implemented
in the R package vicar available at https://github.com/dcgerard/vicar .Comment: 42 pages, 11 figures, 3 table
Small World MCMC with Tempering: Ergodicity and Spectral Gap
When sampling a multi-modal distribution , x\in \rr^d, a Markov
chain with local proposals is often slowly mixing; while a Small-World sampler
\citep{guankrone} -- a Markov chain that uses a mixture of local and long-range
proposals -- is fast mixing. However, a Small-World sampler suffers from the
curse of dimensionality because its spectral gap depends on the volume of each
mode. We present a new sampler that combines tempering, Small-World sampling,
and producing long-range proposals from samples in companion chains (e.g.
Equi-Energy sampler). In its simplest form the sampler employs two Small-World
chains: an exploring chain and a sampling chain. The exploring chain samples
, , and builds up an empirical
distribution. Using this empirical distribution as its long-range proposal, the
sampling chain is designed to have a stationary distribution . We prove
ergodicity of the algorithm and study its convergence rate. We show that the
spectral gap of the exploring chain is enlarged by a factor of and that
of the sampling chain is shrunk by a factor of . Importantly, the
spectral gap of the exploring chain depends on the "size" of while
that of sampling chain does not. Overall, the sampler enlarges a severe
bottleneck at the cost of shrinking a mild one, hence achieves faster mixing.
The penalty on the spectral gap of the sampling chain can be significantly
alleviated when extending the algorithm to multiple chains whose temperatures
follow a geometric progression. If we allow , the
sampler becomes a global optimizer.Comment: 24 pages, 3 figure
Unifying and Generalizing Methods for Removing Unwanted Variation Based on Negative Controls
Unwanted variation, including hidden confounding, is a well-known problem in
many fields, particularly large-scale gene expression studies. Recent proposals
to use control genes --- genes assumed to be unassociated with the covariates
of interest --- have led to new methods to deal with this problem. Going by the
moniker Removing Unwanted Variation (RUV), there are many versions --- RUV1,
RUV2, RUV4, RUVinv, RUVrinv, RUVfun. In this paper, we introduce a general
framework, RUV*, that both unites and generalizes these approaches. This
unifying framework helps clarify connections between existing methods. In
particular we provide conditions under which RUV2 and RUV4 are equivalent. The
RUV* framework also preserves an advantage of RUV approaches --- their
modularity --- which facilitates the development of novel methods based on
existing matrix imputation algorithms. We illustrate this by implementing RUVB,
a version of RUV* based on Bayesian factor analysis. In realistic simulations
based on real data we found that RUVB is competitive with existing methods in
terms of both power and calibration, although we also highlight the challenges
of providing consistently reliable calibration among data sets.Comment: 34 pages, 6 figures, methods implemented at
https://github.com/dcgerard/vicar , results reproducible at
https://github.com/dcgerard/ruvb_sim
Bayesian methods for genetic association analysis with heterogeneous subgroups: From meta-analyses to gene-environment interactions
Genetic association analyses often involve data from multiple
potentially-heterogeneous subgroups. The expected amount of heterogeneity can
vary from modest (e.g., a typical meta-analysis) to large (e.g., a strong
gene--environment interaction). However, existing statistical tools are limited
in their ability to address such heterogeneity. Indeed, most genetic
association meta-analyses use a "fixed effects" analysis, which assumes no
heterogeneity. Here we develop and apply Bayesian association methods to
address this problem. These methods are easy to apply (in the simplest case,
requiring only a point estimate for the genetic effect and its standard error,
from each subgroup) and effectively include standard frequentist meta-analysis
methods, including the usual "fixed effects" analysis, as special cases. We
apply these tools to two large genetic association studies: one a meta-analysis
of genome-wide association studies from the Global Lipids consortium, and the
second a cross-population analysis for expression quantitative trait loci
(eQTLs). In the Global Lipids data we find, perhaps surprisingly, that effects
are generally quite homogeneous across studies. In the eQTL study we find that
eQTLs are generally shared among different continental groups, and discuss
consequences of this for study design.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS695 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Efficient Algorithms for Multivariate Linear Mixed Models in Genome-wide Association Studies
Multivariate linear mixed models (mvLMMs) have been widely used in many areas
of genetics, and have attracted considerable recent interest in genome-wide
association studies (GWASs). However, fitting mvLMMs is computationally
non-trivial, and no existing method is computationally practical for performing
the likelihood ratio test (LRT) for mvLMMs in GWAS settings with moderate
sample size n. The existing software MTMM perform an approximate LRT for two
phenotypes, and as we find, its p values can substantially understate the
significance of associations. Here, we present novel computationally-efficient
algorithms for fitting mvLMMs, and computing the LRT in GWAS settings. After a
single initial eigen-decomposition (with complexity O(n^3)) the algorithms i)
reduce computational complexity (per iteration of the optimizer) from cubic to
linear in n; and ii) in GWAS analyses, reduces per-marker complexity from cubic
to quadratic in n. These innovations make it practical to compute the LRT for
mvLMMs in GWASs for tens of thousands of samples and a moderate number of
phenotypes (~2-10). With simulations, we show that the LRT provides correct
control for type I error. With both simulations and real data we find that the
LRT is more powerful than the approximate LRT from MTMM, and illustrate the
benefits of analyzing more than two phenotypes. The method is implemented in
the GEMMA software package, freely available at
http://stephenslab.uchicago.edu/software.htm
Integrated analysis of variants and pathways in genome-wide association studies using polygenic models of disease
Many common diseases are highly polygenic, modulated by a large number
genetic factors with small effects on susceptibility to disease. These small
effects are difficult to map reliably in genetic association studies. To
address this problem, researchers have developed methods that aggregate
information over sets of related genes, such as biological pathways, to
identify gene sets that are enriched for genetic variants associated with
disease. However, these methods fail to answer a key question: which genes and
genetic variants are associated with disease risk? We develop a method based on
sparse multiple regression that simultaneously identifies enriched pathways,
and prioritizes the variants within these pathways, to locate additional
variants associated with disease susceptibility. A central feature of our
approach is an estimate of the strength of enrichment, which yields a coherent
way to prioritize variants in enriched pathways. We illustrate the benefits of
our approach in a genome-wide association study of Crohn's disease with
~440,000 genetic variants genotyped for ~4700 study subjects. We obtain strong
support for enrichment of IL-12, IL-23 and other cytokine signaling pathways.
Furthermore, prioritizing variants in these enriched pathways yields support
for additional disease-association variants, all of which have been
independently reported in other case-control studies for Crohn's disease.Comment: Summitted to PLoS Genetic
Wavelet-based genetic association analysis of functional phenotypes arising from high-throughput sequencing assays
Understanding how genetic variants influence cellular-level processes is an
important step towards understanding how they influence important
organismal-level traits, or "phenotypes", including human disease
susceptibility. To this end scientists are undertaking large-scale genetic
association studies that aim to identify genetic variants associated with
molecular and cellular phenotypes, such as gene expression, transcription
factor binding, or chromatin accessibility. These studies use high-throughput
sequencing assays (e.g. RNA-seq, ChIP-seq, DNase-seq) to obtain high-resolution
data on how the traits vary along the genome in each sample. However, typical
association analyses fail to exploit these high-resolution measurements,
instead aggregating the data at coarser resolutions, such as genes, or windows
of fixed length. Here we develop and apply statistical methods that better
exploit the high-resolution data. The key idea is to treat the sequence data as
measuring an underlying "function" that varies along the genome, and then,
building on wavelet-based methods for functional data analysis, test for
association between genetic variants and the underlying function. Applying
these methods to identify genetic variants associated with chromatin
accessibility (dsQTLs) we find that they identify substantially more
associations than a simpler window-based analysis, and in total we identify 772
novel dsQTLs not identified by the original analysis
varbvs: Fast Variable Selection for Large-scale Regression
We introduce varbvs, a suite of functions written in R and MATLAB for
regression analysis of large-scale data sets using Bayesian variable selection
methods. We have developed numerical optimization algorithms based on
variational approximation methods that make it feasible to apply Bayesian
variable selection to very large data sets. With a focus on examples from
genome-wide association studies, we demonstrate that varbvs scales well to data
sets with hundreds of thousands of variables and thousands of samples, and has
features that facilitate rapid data analyses. Moreover, varbvs allows for
extensive model customization, which can be used to incorporate external
information into the analysis. We expect that the combination of an easy-to-use
interface and robust, scalable algorithms for posterior computation will
encourage broader use of Bayesian variable selection in areas of applied
statistics and computational biology. The most recent R and MATLAB source code
is available for download at Github (https://github.com/pcarbo/varbvs), and the
R package can be installed from CRAN
(https://cran.r-project.org/package=varbvs).Comment: 31 pages, 6 figure
Flexible signal denoising via flexible empirical Bayes shrinkage
Signal denoising---also known as non-parametric regression---is often
performed through shrinkage estimation in a transformed (e.g., wavelet) domain;
shrinkage in the transformed domain corresponds to smoothing in the original
domain. A key question in such applications is how much to shrink, or,
equivalently, how much to smooth. Empirical Bayes shrinkage methods provide an
attractive solution to this problem; they use the data to estimate a
distribution of underlying "effects", hence automatically select an appropriate
amount of shrinkage. However, most existing implementations of Empirical Bayes
shrinkage are less flexible than they could be--both in their assumptions on
the underlying distribution of effects, and in their ability to handle
heterskedasticity---which limits their signal denoising applications. Here we
address this by taking a particularly flexible, stable and computationally
convenient Empirical Bayes shrinkage method, and we apply it to several signal
denoising problems. These applications include smoothing of Poisson data and
heteroskedastic Gaussian data. We show through empirical comparisons that the
results are competitive with other methods, including both simple thresholding
rules and purpose-built Empirical Bayes procedures. Our methods are implemented
in the R package smashr, "SMoothing by Adaptive SHrinkage in R," available at
https://www.github.com/stephenslab/smash
- β¦